-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: generalized wildcard queries #590
Conversation
This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog. Note that this might contain changes that are on main, but not yet released. Changelog: 0.3.0 (2024-10-16)⚠ BREAKING CHANGES
FeaturesBug Fixes |
By the way, the majority of the line-count (12634 lines) is the new (more-verbose) lineage definition file: testBaseData/exampleDataset/lineage_definitions.yaml |
886c2aa
to
7e94a82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message should reference the issue (ideally with resolves #...
in the footer - then release please will automatically pick it up and mention it in the changelog)
Also this is a breaking change - the commit message should mention it.
Also I wonder whether it might be good to still have a metadata type lineage
or similar? Would there be any advantages?
scripts/alias2lineageDefinitions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this quite complicated. Wouldn't it be easier to do it in plain Python?
- You get better IDE support compared to when writing SQL
- We would not need to DuckDB dependency in Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatGPT gave me this, which almost works:
import json
import argparse
import yaml
import shutil
import sys
import os
import tempfile
# Define the argument parser
parser = argparse.ArgumentParser(description="Process a JSON file and convert it to a DataFrame.")
parser.add_argument('alias_key', type=str, help='Path to the alias_key in JSON format')
parser.add_argument('lineage_file', type=str, help='Path to the input_file containing all lineages')
parser.add_argument('--preserve-tmp-dir', action='store_true', help='Preserve the temporary directory to keep the intermediate files')
# Parse the arguments
args = parser.parse_args()
# Load the JSON data from the file path provided as argument
with open(args.alias_key, 'r') as file:
alias_key = json.load(file)
# Create a list to store the reformatted data
reformatted_data = []
# Loop through the JSON and format it as desired
for key, value in alias_key.items():
if value and isinstance(value, str):
reformatted_data.append({"name": key, "alias": value})
else:
reformatted_data.append({"name": key, "alias": key})
# Load the lineage data from the file path provided as argument
with open(args.lineage_file, 'r') as file:
lineages = file.read().splitlines()
# Create a dictionary to store the alias mappings
alias_dict = {item['name']: item['alias'] for item in reformatted_data}
# Function to unalias a lineage
def unalias_lineage(lineage):
parts = lineage.split('.')
if parts[0] in alias_dict:
parts[0] = alias_dict[parts[0]]
return '.'.join(parts)
# Unalias all lineages
unaliased_lineages = {lineage: unalias_lineage(lineage) for lineage in lineages}
# Function to find the immediate parent lineage
def find_immediate_parent(lineage):
if '.' in lineage:
return lineage.rsplit('.', 1)[0]
return None
# Create the lineage definitions
lineage_definitions = {}
for lineage, unaliased in unaliased_lineages.items():
parent = find_immediate_parent(unaliased)
if parent:
parent_unaliased = unalias_lineage(parent)
lineage_definitions[lineage] = {"parents": [parent_unaliased]}
else:
lineage_definitions[lineage] = {"parents": []}
# Transform the data into YAML
yaml.dump(lineage_definitions, sys.stdout, default_flow_style=False)
if args.preserve_tmp_dir:
temp_dir = tempfile.mkdtemp()
print(f"Temporary directory: {temp_dir}", file=sys.stderr)
else:
temp_dir = tempfile.mkdtemp()
shutil.rmtree(temp_dir)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the SQL based script is less error-prone but effectively equivalent. I really like that I can just look at the intermediate tables and see the precise output of every single step in the script. What would the finished alternative look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that I need to look at SQL statements to understand the code. Probably I'm just a lot less used to SQL :D
The finished script would look quite similar I guess. It resolves an alias in a couple places where it should print the alias instead (A.1.2.3.4 instead of AA.4). I assume it's easy to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a hybrid and moved the ugly SQL part into python :)
The remaining part would be much more error prone (implementing the first part already demonstrated that to me :D) and a lot more verbose in my opinion
406b12f
to
561327d
Compare
94f714f
to
93733a8
Compare
ff6cb04
to
c1cb660
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually looks good, but it's not quite ready to be merged yet.
Also please make sure that the changelog contains the issue.
f3b1267
to
422d2e2
Compare
1e2220b
to
b9a24c7
Compare
8222885
to
b9a24c7
Compare
I created GenSpectrum/LAPIS#978 to update the documentation. |
b9a24c7
to
71120a7
Compare
BREAKING CHANGES: The preprocessing config field pangoLineageDefinitionFilename has been renamed to lineageDefinitionFilename. We now accept a YAML lineage definition file instead of a Pango alias key. A script (scripts/alias2lineageDefinitions.py) is provided to transform a list of lineages and an alias into the required file format. Input and query validation now checks whether the provided lineage exists in the defined lineages, and errors are thrown if validation fails. resolves #458
5807476
to
0336600
Compare
0336600
to
0b2a670
Compare
resolves #458
Summary
This PR resolves the generalized lineage refactor. Currently the system only worked with lineages following the pango lineage format as it was designed for SARS-CoV-2.
Now this is generalized and just receives an arbitrary lineage definition file which outlines the parent-child relations for the provided lineages.
Breaking Changes:
scripts/alias2lineageDefinitions.py
that transforms a list of lineages and an alias into the requested file formatIn the current version, we do not allow alias queries. This means that the user-provided query needs to e.g. contain a valid pango lineage for the respective instances. Searching for
B.1.1.529.1
does not work and instead a search forBA.1
is requiredPR Checklist